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Although there have been many studies of native Korean 
cattle, Hanwoo, there have been no selective sweep studies in 
these animals. This study was performed to characterize 
genetic variation and identify selective signatures. We sequen- 
ced the genomes of 12 cattle, and identified 15125420 SNPs, 
1768114 INDELs, and 3445 CNVs. The SNPs, INDELs, and 
CNVs were similarly distributed throughout the genome, and 
highly variable regions were shown to contain the BolA family 
and GPR180, which are related to adaptive immunity. We also 
identified the domestication footprints of the Hanwoo 
population by searching for selective sweep signatures, which 
revealed the RCN2 gene related to BPV resistance. The results 
of this study may contribute to genetic improvement of the 
Hanwoo population in Korea. [BMB Reports 2013; 46(7): 
346-351] 



INTRODUCTION 

Identification of the genetic differences responsible for varia- 
tions in phenotypic traits is one of the goals of livestock ge- 
nomic research. Accordingly, it is important to characterize the 
genetic variation in livestock species. Domestication processes 
and breeding development have been imprinted in the ge- 
nomes and provide detectable clues of selection within the cat- 
tle genome (1, 2). Selective sweep signatures may have con- 
tributed to the domestication process (3, 4). Despite various 
Hanwoo (Korean native cattle) mapping and diversity studies 
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(5, 6), the identification of mutations affecting some quantita- 
tive phenotypes (7, 8), and several expression studies (9-11), se- 
lective sweep signatures in these animals have not been 
identified. There have been several attempts to conduct such 
characterization at the population level of other cattle by de- 
tecting selective sweep signatures (1, 2). Hanwoo were used as 
a draft type until about 30 years ago, whereas dairy cattle have 
been artificially selected for milk production for a long time. 
However, Hanwoo have only been bred for meat for a short 
period, and it is important to evaluate its genetic features to im- 
prove performance as beef cattle. Therefore, the present study 
was performed to characterize the entire genome sequence of 
Hanwoo Korean native cattle at the population level. 

Genetic variations, including SNPs, insertions and deletions 
(INDELs), and large structural variations, such as CNV, shape 
the genome architecture and provide signatures for identi- 
fication of variations contributing to adaptation. Large struc- 
tural variations can be attributed to inversions and trans- 
locations, duplications, and deletions. Although existing at 
lower rates than SNPs, CNVs and INDELs comprise another 
important type of genomic variation (11) that can be used to 
better understand genome features. 

Adaptation from standing genetic variation implies that there 
are neutral or weakly deleterious variations that are maintained 
with a long history in populations and that they become ad- 
vantageous with changes in the environment (12). Experimental 
evolution, which tests hypotheses or theories of evolution using 
an experimentally controlled population, has provided evi- 
dence that adaptation from standing variation is more repeat- 
able than that from new mutations (12). Standing variation 
could contribute to rapid adaptation after a sudden environ- 
mental change (13); for example, Rowan et al. showed ex- 
perimentally that sticklebacks have sufficient standing genetic 
variation to adapt to a significant change in climate in only 
three generations (14). Such adaptations are influenced by the 
type and quantity of the available genetic variation (15). 
Therefore, a comprehensive description of variations in a pop- 
ulation will lead to a deeper understanding of its biology. 

To investigate the features of the Hanwoo genome, a ge- 
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nomic scan of the Korean cattle population was performed for 
selective sweep signatures and highly variable regions were 
identified. 

RESULTS 

Identification of genetic variants 

To construct high-quality Hanwoo re-sequencing data, we gen- 
erated pair-end reads using an lllumina HiSeq2000. After re- 
moving possible PCR duplicates, 97.12% of all the reads, 
which together corresponds to mapping of 37603284033 bp, 
each individual was successfully aligned against the bovine 
reference genome UMD3.1. On average, a depth of 14.2 x 
was achieved and the mapped reads covered an average of 
98.96% of the reference genome. The depth was calculated 
for each individual using the DepthOfCoverage.jar of GATK 
(16), with the range of depth of 12.61 x-15.53x (Table S1) 
and coverage was calculated using BEDTools (1 7). 

For the UMD3.1 reference assembly, a total of 15125420 
SNPs were identified, and the change rate was 166 base pairs 
(Table S2). Of these SNPs, 27% were located in genie regions, 
while 73% were located in intergenic regions (Supplement Fig. 
2). A total of 1 7681 14 INDELs were identified, and the change 
rate was 1420 base pairs (Table S2). Of these INDELs, 30% 
were located in genie regions and 70% were located in inter- 
genic regions (Supplement Fig. 2). The minimum quality of 
SNPs and INDELs was 30. The quality of variants and INDEL 
length are shown in Supplment Fig. 6 and 7. 

In this study, CNVs were defined based only on deletion 
type using Genome STRiP (18). A total of 3445 CNVs (10.2 
Mb), with an average length of 2.97 kb (range 0.23-902 kb) 
were identified, and the change rate was 729196 base pairs 
(Table S2). Overall, 1 7% of the CNVs were located in genie re- 
gions overlapping with at least one gene, if not covering an en- 
tire gene, while 83% were located in intergenic regions with- 
out overlapping any gene (Supplement Fig. 2). Evaluation of 
the length distribution of CNVs showed that the number of 
CNVs decreased with increasing size, except for the 1-3 kb re- 
gion (Supplement Fig. 8). This exception corresponds to LINEs 
around 2,000 bp. This effect has also been identified in hu- 
mans and cattle using the same method as used in this study 
and with other methods, respectively (18, 19). These findings 
indicated that the exception is not due to the analysis method 
or sample population. 

Correlation among variants and repeat elements 

We found that SNPs, INDELS, and CNVs had similar dis- 
tributions throughout the genome, and that the number of var- 
iants was related to the length of the genome. We calculated 
the proportions of the three types of variants on each chromo- 
some and found that chromosomal proportions of each type of 
variant were similar to those of chromosomal length 
(Supplement Fig. 1a) and that the number of variants had a 
strong linear relationship with chromosomal length (Supple- 



ment Fig. 1b-d). 

The number of variants at each 1 Mb bin and 10 Mb bin 
were similarly distributed throughout the genome (Fig. 1, 
Supplement Fig. 3). CNVs were relatively less similar to the 
other variants. As there was a small number of CNVs when us- 
ing the large 10 Mb bin compared to the 1 Mb bin, the ge- 
nomic distribution of the CNVs became similar to that of the 
other types of variants (Supplement Fig. 3). 

Highly variable region 

There are common regions in which many types of variants ex- 
ist (Fig. 1). For example, the BoLA family is present in the first 
polymorphic common region, chr23:23-31 Mb, which is lo- 
cated near the centromere. The BoLA family is a type of cattle 
MHC, and has sufficient variation for immune defense. There 
are also many contigs around the 2 nd highest peak of common 
region, chrl 2:69-77 Mb, which is near the centromere. 
Specifically, GPC6, BIRC3, DCT, TCDS, GPR180, CLDN10, 
DZIP1, DNAJC3, HS6ST3, MBNL2, and RAP2A are located in 
this region. The average nucleotide diversity was as high as 
12.95 (Table S3), calculated by VCFtools (20). BoLA and 
GPR180 are located in a highly variable region, containing suf- 
ficient variants, including synonymous SNPs, non-synonymous 
SNPs, and intronic SNPs (Supplement Fig. 4A, B). However, 
R.CN2 has only a small number of variants (Supplement Fig. 
4C). SNPs and CNVs are enriched with MHC and a G pro- 
tein-coupled receptor gene region, which has also been re- 
ported in other species (21). 

A significant genome-wide correlation between the number 
of SNPs and INDELs was identified (Pearson's correlation r = 
0.87, P < 0.05) (Supplement Fig. 5). SNPs and INDELs were 
also highly correlated with CNV, while low complexity and 
simple repeats were positively correlated with CNV, and the 
GC content was negatively correlated with CNV. 

Selective sweep region 

We estimated a folded SFS because no ancestral allele in- 
formation was available. The SFS of the SNPs and INDELs 
were normal, while those of CNV showed balancing selection 
(Supplement Fig. 1 1). Linkage disequilibrium (LD)-based to sta- 
tistics and SFS-based A statistics available within the pop- 
ulation were used to identify selective sweep regions (Fig. 2). 
As expected, SweepFinder and OmegaPlus detected different 
regions and showed strong signals. The highest significance re- 
gions of A statistics were chr2:72-72.5 Mb and chr10:35.5-36 
Mb. The chr21:32-33 Mb, chr18:5. 5-10.5 kb, and 15.9 Mb re- 
gions showed clear co statistics signatures. The first significant 
A statistics region (chr2:72-72.5 Mb) included the genes 
PTPN4, EPB41L5, TMEM185B, RALB, and INHBB, while the 
genes FSIP1, CPR176, SRP14, BMP, and PAK6 were located in 
the second significant A statistics region (chrl 0:35.5-36 Mb) 
(Table 1). The first significant co statistics region (chr21:32-33 
Mb) included the genes ETFA, ISL2, RCN2, PSTPIP1, and 
TSPAN3, while the ITFG1 gene was located in the second sig- 
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nificantco statistics region (chr18:15.9 Mb) (Table 1). 

Combination statistics were used to identify common 
signals. The RCN2 gene was located in the highest combina- 
tion statistics region (chr21:32.5 Mb) between the to statistics 
and A statistics region. This gene produces a calcium-binding 
protein located in the lumen of the ER that contains six con- 
served regions with similarity to a high affinity Ca +2 -binding 
motif, the EF-hand. 

DISCUSSION 
Highly variable region 

The three types of variants were similarly distributed through- 
out the genome. Even if a type of variant was different, the 
quantity of the variant was affected by common effects, such 
as the mutation and recombination rates. These findings were 
supported by the results of a previous study in which SNP de- 
tection accuracy was shown to be affected by CNVs (19, 21). 
A significant genome-wide correlation between SNP and 
INDEL density was identified in a separate study (19). 

We identified a hypervariable region near the centromere 
and showed that the mutations were related to adaptive 
immunity. Specifically, the bovine MHC class II region lies 
near the centromere of BTA23 (Fig. 1). Variation in the high re- 
combination rate region may be due to the existence of a poly- 
morphic recombination hotspot (22-24), and regions with high 
recombination rates were significantly closer to the cen- 
tromere (25). Genetic standing variants facilitate the emer- 
gence of adaptation. The MHC region contains a diverse array 
of genes that are crucial for the initiation of adaptive immune 



responses. All types of variants were enriched with the BoLA 
(bovine MHC) family in the highest peak region and CPR 
genes involved in the interaction with extracellular molecules 
in the second highest peak region (Fig. 1). Similar findings 
have been reported in other animals, e.g., the three-spined 
stickleback has many SNP and CNV variants in the MHC re- 
gion (21), and genetic variations at the MHC loci are known to 
be involved in pathogen resistance (26). These regions are 
comprised of many contigs (Supplement Fig. 10) and the re- 
gion of BTA23 has been confirmed experimentally (22-24). 
The cattle genome build UMD 3.1 has a low amount of erro- 
neous duplication and error (27, 28); therefore, this was not 
caused by UMD 3.1 genome assembly. 

Selective sweep genes 

Generally, cattle have been marked by selection during do- 
mestication, breed formation, and ongoing selection to en- 
hance performance and productivity. We utilized two methods 
to detect genomic selection in cattle: (i) the co statistic (29, 30), 
which detects specific LD patterns caused by genetic hitchhik- 
ing, and (ii) the composite likelihood ratio (CLR) A (31) using 
the SFS, which describes the frequency of allelic variants and 
shifts from neutral expectation toward rare and high frequency 
derived variants. 

We found evidence of selective sweeps on chromosomes 2, 
10, 18, and 21 (Table 1 and Fig. 2). Among these regions, we 
identified commonly selected regions near RCN2 (E6BP) by 
considering the co and A values together, and this region had 
the highest correlation (r = 0.9983). This gene interacted with 
cancer-associated HPV E6 and with BPV-1 E6. The transforming 
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Fig. 2. Manhattan plots of selective 
sweep signature of the Hanwoo pop- 
ulation: panel (A) co statistics from 
Omega Plus; panel (B) A statistics 
(CLR) from SweepFinder. The x-axis of 
the Manhattan plot shows the ge- 
nomic position, and the y-axis repre- 
sents each statistic. The cutoff for the 
co statistics and the A statistics was 
the 99.9% quartile of empirical dis- 
tribution of each statistic. 



Table 1. Selective sweep genes of CLR ratio and Omega statistics 
(significance level: empirical P value 0.001) 



Symbol 


Location (Mb) 


CLR ratio 


PTPN4 


2:72.05-72.15 


1106.470 


EPB41L5 


2:72.17-72.33 


1190.173 


RALES 


2:72.36-72.43 


709.583 


INHBB 


2:72.52 


540.436 


FSIP1 


10:35.35-35.54 


551.626 


GPR.176 


10:35.56-35.69 


851.561 


SR.P14 


10:35.80 


456.310 


BMP 


10:35.85-35.87 


442.939 


PAK6 


10:35.98-36.01 


537.912 


Symbol 


Location (Mb) 


Omega 


ITFC1 


18:15.6-15.9 


4.519 


ETFA 


21:31.9-32 


9.607 


ISL2 


21:32-32.1 


5.7917 


R.CN2 


21:32.57-32.59 


5.291 


PSTPIP1 


21:32.64-32.69 


8.454 


TSPAN3 


21:32.7-32.72 


12.280 



activity of BPV-1 E6 mutants was correlated with their 
E6BP-binding ability (32). Calcium is required for entry into mi- 
tosis, and E6 may play a role in this stage of cell growth, in- 
dicating that R.CN2 is also important. The RCN gene has under- 
gone a selective sweep, which may suggest that Korean cattle 
are resistant to BPV (33). In support of this suggestion, ex- 
perimental evidence has been reported indicating that Korean 
native cattle have greater resistance to BPV than Holsteins (34). 

Using the co and A values, several genes were detected and 
we investigated whether this region was related with the cattle 



quantitative trait locus (QTL) except dairy cattle QTL infor- 
mation. Among the QTLs identified in the selective sweep re- 
gion, the chrl 0:35.5-36 Mb region is related to the long- 
issimus muscle area and the chrl 8:1 5.9 Mb region is related to 
carcass weight, longissimus muscle area, and social separation 
vocalization. Hanwoo was used as a draft type of cattle before 
1980, but is now used for beef production; therefore, the QTLs 
of the selective sweep region are reasonable. 

The results of the present study indicated that the dis- 
tributions of SNPs, INDELs, and CNVs have a correlated 
pattern. We found that the selective sweep signatures of the 
Hanwoo genome and the highly variable region were related 
to adaptive immunity. We hope that these characteristics will 
contribute to genetic improvement and breeding of this strain. 
Future studies should include larger samples and various 
breeds and phenotypes to obtain better understanding of cattle 
genomes. 

MATERIALS AND METHODS 

Sample preparation and re-sequencing 

Whole-blood samples were collected from 12 Hanwoo bulls 
from Kyungpook National University, Korea. Blood (10 ml) was 
drawn from the carotid artery and treated with heparin to pre- 
vent clotting. DNA was isolated from whole blood using a 
G-DEXTMIIb Genomic DNA Extraction Kit (iNtRoN Biotechno- 
logy, Seoul, Korea) according to the manufacturer's protocol. 
We randomly sheared 3 u,g of genomic DNA using the Covaris 
System to generate inserts of ~300 bp. The fragments of 
sheared DNA were end-repaired, A-tailed, adaptor ligated, and 
amplified using a TruSeq DNA Sample Prep. Kit (lllumina, San 
Diego, CA). Paired-end sequencing was conducted by NICEM 
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(National Instrumentation Center for Environmental Manage- 
ment, Seoul, Korea) using the lllumina HiSeq2000 platform with 
TruSeq SBS Kit v3-HS (lllumina). Finally, sequence data were 
generated using the lllumina HiSeq system. 

Sequence alignment and genotype calling 

Pair-end sequence reads were mapped to the reference bovine 
genome (UMD 3.1) using Bowtie 2 with the default settings 
(35). Four open-source packages were used for downstream 
processing and variant calling for SNPs and INDELs: Picard 
tools (http://picard.sourceforge.net), SAMtools 0.1.18 (36), 
VCFtools 4.0 (20), and the Genome Analysis Toolkit 1.4 (16). 
Read Group was added and duplicate reads were removed us- 
ing MarkDuplicates of Picard tools. SAMtools was used to in- 
dex the resulting bam format files and to calculate the mapped 
read length with the flagstat option (36). Realignment and var- 
iant calling were performed using GATK (16) with Count- 
Covariates, RealignerTargetCreator, IndelRealigner, Select- 
Variant, VariantFiltration, and UnifiedGenotyper. VCFtools 
was used for handling the vcf file format (20). 

Substitution calls were made with GATK UnifiedGenotyper 
(16). SNPs and INDELs called with a Phred-scaled quality 
score of less than 30 were filtered out. Variants were removed 
based on MQO (median quality score zero) >4, (MQO/read 
depth) >10%, quality depth 5, and FS (Phred-scaled P value 
using Fisher's exact test to detect strand bias) >200. After fil- 
tering, the SNPs were filtered again by removing those within 
10 bp of INDELs. As co and A statistics require haplotype in- 
formation of each chromosome, we used BEAGLE (37) to infer 
the haplotype phase and impute missing alleles for the entire 
set of cattle populations simultaneously. 

Genome STRiP (18) was used for deletion discovery and 
genotyping of structural variants in the population using the re- 
peat masked genome. Initial genotype likelihoods were de- 
rived with a Bayesian model. Annotation information was ob- 
tained from Ensembl 68 (UMD 3.1) (38) and SnpEff 3.0 (39). 

Statistical analysis 

Cattle genomes were divided into bins of 1 Mb, and footprints 
of positive selection were calculated for each bin with a grid 
size of 1000 using LD-based statistics with OmegaPlus (29, 30) 
and SFS-based statistics with SweepFinder (31). The cutoff for co 
statistics from OmegaPlus and A statistics from SweepFinder 
was the 99.9% quartile of empirical distribution of each 
statistic. Combination statistics using A statistics and co statistics 
were calculated . To check the correlation between distribution 
of variants and that of repeat elements, repeat element in- 
formation was obtained from UCSC with RepeatMasker 
Open-3.0 (40) using bovine genome UMD 3.1. The number of 
variants and each of the repeat elements were counted in each 
1 Mb bin region. The correlations were calculated among 
them, and they were drawn using the corrplot R. package (41). 
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